Contributors: Alec Jacobs & Richard Barad | 12.15.2023
Course: Public Policy Analytics, Master of Urban Spatial Analytics
For an abbreviated high level summary of this analysis please watch our video published to YouTube.
The City of Chicago is the third most populated metropolitan area in the United States, boasting a robust culinary landscape with over 15,000 restaurants. Each of the city’s restaurants are subject to a recurring inspection, conducted under the purview of the Department of Public Health’s Division of Food Protection. The task of inspection falls on only 36 accredited sanitarians – leaving each with the responsibility of inspecting around 470 of the 15,000. This resource disparity presents a notable challenge to efficient and effective food safety monitoring.
Addressing resource disparity, Chicago’s solution includes leveraging data and partnering the city’s Department of Health with the Advanced Analytics Team housed in the Department of Innovation and Technology. Together, they developed a predictive model aimed at identifying restaurants most likely to have critical food safety violations. The existing model combines historical data, encompassing details from 100,000 past sanitation inspections – along with business characteristics and 311 complaints. The model enables inspectors to prioritize their inspections, concentrating on areas of the greatest risk to fail future food inspections.
The model’s implementation has yielded positive results harnessed by the city of Chicago. Food establishments with critical violations are now more likely to be identified earlier, leading to faster interventions and ultimately creating a increased quality along with availability of dining for Chicago residents. Adopting a more “science-based” initiative highlights the potential of data analytics to enhance public health and economic outcomes, particularly in city governments with limited resources.
This project is based on a hypothetical scenario in which the City of Chicago has launched a program through its Department of Public Health and Office of Food Protection, allocating money to a Local Restaurant Improvement Fund. The city aims to appropriate funding resources to establishments facing elevated probability of non-compliance with food safety standards set by the Food Protection Division. Restaurants that receive this funding would be reminded of the importance of complying with public health guidelines and passing food inspections. As mentioned, on a yearly basis, all retail food establishments throughout Chicago are subject to recurring inspections. If a restaurant that has received funding is found to be in violation of public health guidelines, it would no longer be eligible for financial support through the program.
Building on the city’s previous work in predictive modeling, we have developed a supplementary forecasting tool tailored to predict a restaurant’s likelihood of failing food inspections during the upcoming year. Our model uses logistic regression to provide a one year forecast on the probability that a given restaurant will fail an inspection during the upcoming year. We train our model using restaurant inspection data for 2021, and divide the dataset into a training and test dataset. The training dataset is used to build the model and the test dataset is used to examine the accuracy of model predictions. This undertaking represents a departure from the conventional procedures common throughout the nation, in favor of a “science-based” methodology employed by the city. However, unlike the city’s current use of their predictive model, our model looks to specifically inform the deployment of funds contingent on passing health inspections. We harness predictive analytics to identify restaurants with an increased susceptibility to regulatory infractions. Our objective is clear: to optimize the impact of the allocated financial resources by strategically directing them towards establishments deemed most at risk of failing food inspections during the upcoming year.
The selected restaurants are set to receive financial support, and profit not only monetarily but benefit by engaging in an educational discourse on the importance of passing food inspections. The eligibility for funding hinges on adhering to food quality standards year after year, reinforcing the commitment to maintaining the restaurant’s quality. If, however, the establishment falls short of the city’s benchmarks, they face the halting of financial support of the city - a harsh reminder of the health standards set in place. Our proposed initiative, thus, unfolds as a structured and strategic intervention, positioning itself as a proactive mechanism for fostering restaurant quality and elevating the overall quality of restaurant offerings throughout the city of Chicago.
To support the coordinated efforts regarding the restaurant improvement program informed by the predictive modeling, we propose a web-based application to share the model outputs. The key features we envision include a heat map showing where restaurants are predicted to fail inspections, a list of restaurant predictors along with corresponding information for each restaurant, a probability or confidence level toggle section, and supporting search and export features. The web-based application would also be translated into a smartphone format – one that may be more translatable to additional stakeholders outside of the city’s decision-makers relating to the Local Restaurant Improvement Fund. The smartphone format would contain the same elements as the browser-based application and will provide a clear direction concerning the funding program.
Upon thoroughly validating and testing our predictive model for desired outcomes, we intend to proceed with the implementation stage. This stage would involve operationalizing the web-based application with support from a software developer for a seamless and scalable launch. We will first run the model to develop predictions for the upcoming year and conduct a preliminary beta test with the city of Chicago to gather real-world feedback, and then prepare a full roll out strategy after initial testing is complete. Developing an effective rollout strategy is crucial for the app’s deployment, and we remain committed to continually refine the application based on valuable insights gathered through current inspection data used in our predictive forecasting model.
We begin building our predictive model by importing food inspection data from the city of Chicago, covering all records since our analysis focuses on historical failures. The provided information originates from assessments conducted on restaurants and food establishments in Chicago from January 1, 2010 to present. These inspections are carried out by personnel from the Chicago Department of Public Health’s Food Protection Program. Subsequently, the results are entered into a database for approval by a State of Illinois Licensed Environmental Health Practitioner. After importing the data on food inspections, we clean and standardize the business names along with the facility types. Our selectiveness ensures the predictive model will be built on the types of establishments (i.e: Restaurants, Bakeries, and Coffee Shops) we are trying to target for the Local Restaurant Improvement Fund.
#Read data from city of Chicago on inspections - we need all data since we look at historical failures
data <- read.socrata("https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5") %>%
na.omit() %>%
mutate(inspection_date = ymd(inspection_date),
year = year(inspection_date),
month = month(inspection_date),
address = str_squish(address),
facility_type = str_squish(str_to_title(facility_type)))
#Clean up a bunch of messy facility names to be able to identify just facilities of (i.e: restaurants, cafes,and bakeries)
data$facility_type[grepl("Bakery", data$facility_type)] <- "Bakery"
data$facility_type[grepl("Coffee", data$facility_type)] <- "Coffee Shop"
data$facility_type[grepl("Ice Cream", data$facility_type)] <- "Ice Cream Shop"
data$facility_type[grepl("Deli", data$facility_type)] <- "Deli"
data$facility_type[grepl("Taqueria", data$facility_type)] <- "Restaurant"
data$facility_type[grepl("Hot Dog Station", data$facility_type)] <- "Restaurant"
data$facility_type[grepl("Juice and Salad Bar", data$facility_type)] <- "Restaurant"
data$facility_type[grepl("Restaurant", data$facility_type)] <- "Restaurant"
#Standardize some names of businesses
data$dba_name[grepl("SEE THRU CHINESE", data$dba_name)] <- 'SEE THRU CHINESE RESTAURANT'
data$dba_name[grepl("JAMAICAN GATES", data$dba_name)] <- 'JAMAICAN GATES'
We download business licenses from the Chicago Open Data Portal. These licenses are issued by the Department of Business Affairs and Consumer Protection in the City of Chicago, covering the period from 2002 to the present. Next, we proceed with importing and cleaning the data on business licenses. The data retrieval encompasses all licenses issued since 2010, the year our analysis starts given the limitation that our inspection data does not extend beyond the last 10 years. The business license data will be used to estimate the age of the restaurant.
# Download Business Licenses From the Chicago Data Portal
liscenses <- read.socrata("https://data.cityofchicago.org/resource/r5kz-chrr.json?$where=license_start_date%20between%20%272010-01-01T12:00:00%27%20and%20%272021-12-31T14:00:00%27")
# Manual Cleaning Of Addresses
corrections <- data.frame(
bad = c("4623-4627 N BROADWAY 1 & 2","100 E WALTON ST 1 104","436-440 E 79TH ST","1733 W 87TH ST 1ST FLOOR","163 E WALTON ST 2ND F","5640 S UNIVERSITY AVE","5255 W MADISON ST 1 B","111 E 51ST ST 1ST`"),
good = c("4623-4627 N BROADWAY", "100 E WALTON ST","436 - 440 E 79TH ST","1733 W 87TH ST","163 E WALTON ST","5640 S UNIVERSITY","5255 W MADISON ST","111 E 51ST ST 1ST"),
stringsAsFactors = FALSE
)
liscenses$address <- match_vec(x=liscenses$address,corrections,from=1,to=2, quiet=TRUE)
liscenses$address[grepl("47 W POLK ST", liscenses$address)] <- "47 W POLK ST"
liscenses$address <- gsub("\\s*\\d+$", "", liscenses$address) #Remove any trailing numbers
liscenses$address <- gsub("\\s*\\d+(ST)?$", "", liscenses$address) #Remove any trailing numbers which are followed by ST
liscenses$address <- gsub("\\s*\\d+(st)?$", "", liscenses$address) #Remove any trailing numbers which are followed by st
#Standardize some business names
liscenses$doing_business_as_name[grepl("SEE THRU CHINESE", liscenses$doing_business_as_name)] <- 'SEE THRU CHINESE RESTAURANT'
liscenses$doing_business_as_name[grepl("THE NEW VALOIS REST", liscenses$doing_business_as_name)] <- 'THE NEW VALOIS REST INC'
liscenses$doing_business_as_name[grepl("CHICAGO MARRIOTT DOWNTOWN", liscenses$doing_business_as_name)] <- 'CHICAGO DOWNTOWN MARRIOTT'
liscenses$doing_business_as_name[grepl("STAR OF SIAM", liscenses$doing_business_as_name)] <- 'STAR OF SIAM'
liscenses$doing_business_as_name[grepl("EL CHILE RESTAURANT & PANCAKE HOUSE.", liscenses$doing_business_as_name)] <- 'EL CHILE RESTAURANT & PANCAKE HOUSE'
liscenses$doing_business_as_name[grepl("FRANCES' DELI & BRUNCHERY", liscenses$doing_business_as_name)] <- "FRANCES' REST & BRUNCHERY"
We collect the spatial data for the neighborhoods and the overall city boundary of Chicago, preparing for our spatial analysis and upcoming visualizations. The boundaries of both the city and its neighborhoods are obtained from the Chicago Data Portal. The neighborhood boundaries layer is refined to include only the neighborhoods and geometry columns, and both the neighborhoods and the city boundary are transformed to a common spatial reference system for compatibility with the other datasets that will be used in our analysis.
neighboorhoods <- st_read('Data/neighboorhoods.shp')%>%
st_transform('ESRI:102271') %>%
dplyr::select(pri_neigh)
chicagoboundary <-
st_read("https://data.cityofchicago.org/api/geospatial/ewy2-6yfk?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271')
We download census by block group for Chicago from the 2021 American Community Survey dataset for the following variables:
The dataset is cleaned and the geometries are transformed into the appropriate projected coordinate system. We use the household variables to calculate the percent of households in a census block that are below the poverty line. The percent poverty and median rent variables will be included as predictors in our model. Additionally, we calculate the percent of the population in a census tract that is white. This information is not included in our model, but will be used latter on to examine if model accuracy varies between majority white and majority non-white neighborhoods.
variables = c("B17017_002", #Total Households w/ income below poverty line
"B17017_001", #Total Households
"B02001_001", #Total Population
"B02001_002", #Total White Population
"B25058_001") #Median Rent
census_data <- get_acs("block group",
year=2021,
output='wide',
geometry=T,
variables = variables,
state = 'IL',
county = 'Cook'
) %>%
st_transform('ESRI:102271') %>%
select(ends_with('E'),'GEOID') %>%
rename(poverty = "B17017_001E",
below_poverty_line = "B17017_002E",
Total_Population = "B02001_001E",
White_Population = "B02001_002E",
Median_Rent = "B25058_001E") %>%
mutate(Median_Rent = ifelse(is.na(Median_Rent),median(Median_Rent,na.rm=TRUE),Median_Rent),
pct_non_white = 100 - ifelse(Total_Population == 0,0,White_Population / Total_Population * 100),
pct_poverty = ifelse(poverty == 0,0, below_poverty_line / poverty * 100)) %>%
select('Median_Rent','pct_non_white','pct_poverty','GEOID')
The first step in building the dataset we use to train our model involves filtering and cleaning the inspection data. This process involves projecting the restaurant data into the appropriate coordinate system and selecting just facilities that match the types of businesses eligible for funding through the restaurant improvement program. We also exclude the inspections conducted in the boundary housing the Chicago O’Hare International Airport and inspections of restaurants located inside Midway International Airport. We treat instances labeled as “Pass w/ Conditions” as equivalent to a regular “Pass,” and conduct a spatial join between the restaurants and neighborhood layer.
We build two clean datasets, one for 2021 and another for 2020. The 2021 will be used for building our model while the 2020 dataset will be used to engineer features about restaurant inspection pass rates in the previous year.
filter <- c("Restaurant","Bakery","Tavern","Ice Cream Shop","Deli","Cafe","Coffee Shop","")
#Function to clean inspection data, join to neighboorhoods and filter to just the year of interest
clean_inspections <- function(df,y){
clean_df <- df %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326, agr = "constant") %>%
st_transform('ESRI:102271') %>%
dplyr::filter(year == y & results != 'Out of Business' & results != 'No Entry' & results != 'Not Ready' & results != 'Business Not Located') %>%
dplyr::filter(facility_type %in% filter) %>%
mutate(results = ifelse(results=="Pass w/ Conditions","Pass",results),
fail = ifelse(results=="Fail",1,0)) %>%
st_join(.,neighboorhoods,predicate = st_intersects) %>%
mutate(pri_neigh = ifelse(location == "(42.008536400868735, -87.91442843927047)","O'Hare",pri_neigh),
pri_neigh = ifelse(location %in% c("(41.892249163400116, -87.60951804879336)","(41.89233780863412, -87.6040447589981)"),"Streeterville",pri_neigh)) %>%
dplyr::filter(pri_neigh != "O'Hare") %>% # Remove Ohare Restaurants
dplyr::filter(address != '5700 S CICERO AVE') # Remove Midway Restaurants
return(clean_df)
}
data_2021 <- clean_inspections(data,2021)
data_2020 <- clean_inspections(data,2020)
This code calculates the age of each business by referencing the date of its initial business license application. Specifically relating to retail establishments, it identifies the date when the license was first obtained. The minimum date is then joined back to the original license data and a subset of relevant columns is selected. The resulting age is estimated in days with a maximum age expected to be around 4,380 days (i.e: 365 days x 12 years).
To enhance accuracy, our code attempts various join methods to join the business license to the inspection data, as direct joining on the license id did not consistently yield matches. The time frame considered spans 12 years, reflecting the available license data from 2010 onward. This ensures that the analysis focuses on the relevant period up to the year 2021. For roughly 6% of 2021 inspections it was not possible to determine the age of the restaurant - we set the age of these restaurants equal to the city wide median. The map below shows the results of our age analysis for restaurants included in the 2021 inspection dataset.
#For each for retail establishment determine date when a license was first obtained
liscense_min <- liscenses %>% group_by(doing_business_as_name,address) %>% summarize(min_date = min(license_start_date)) %>%
arrange(min_date)
#Join min date back to original licence data and select subset of columns
liscenses_final <- left_join(liscenses, liscense_min , by = c('doing_business_as_name','address')) %>%
select(license_id, account_number,legal_name,doing_business_as_name,address,site_number,min_date) %>%
mutate(license_id = as.integer(license_id))
#Join date business applied for first licence to 2021 inspection data - try to joins a variety of different ways since joining on licence number did not match allways
data_2021_2 <- left_join(data_2021,liscenses_final %>% select(license_id,min_date),by=join_by(license_==license_id)) %>%
rename(min_date1 = 'min_date') %>%
left_join(.,liscense_min %>% select(doing_business_as_name,address,min_date),by=join_by(dba_name==doing_business_as_name,address==address),multiple='first') %>%
rename(min_date2 = 'min_date') %>%
left_join(.,liscense_min %>% select(doing_business_as_name,address,min_date),by=join_by(aka_name==doing_business_as_name,address==address),multiple='first') %>%
rename(min_date3 = 'min_date') %>%
mutate(min_date = pmin(min_date1, min_date2, min_date3, na.rm = TRUE), #Select lowest date where multiple joins worked
age = as.integer(difftime(inspection_date, min_date, units = 'days'))) %>%
select(-min_date1,-min_date2,-min_date3)
data_2021_2 <- data_2021_2 %>%
mutate(age = ifelse(age<0,0,age),
age = ifelse(is.na(age),median(age,na.rm=TRUE),age))
ggplot() +
geom_sf(data = neighboorhoods, color = "white", fill = "grey80") +
geom_sf(data = data_2021_2, aes(color = age), size = 0.5) +
scale_color_viridis_c(name = "Age (Days)") +
labs(title = "Restaurant Age (Days)") +
theme_void()
This code estimates the number of previous violations for each business within the dataset. It begins by filtering the data based on the condition that the year is earlier than 2021 and the outcome is defined as ‘Fail’. We then group the dataset by business name and address, and the violations are counted. The results are then joined to the 2021 inspection data after being renamed. The purpose of this code is to provide an overview of the historical violation records associated with each business, helping us understand the full compliance history throughout the city of Chicago.
We also use the age variable and the previous fail variable to calculate the number of failures per day open. This variable serves as an interaction term that considers both the age of the restaurant and the previous number of failures. The map below show the results of our previous inspection failure analysis for all restaurants in the 2021 inspection dataset.
fails <- data %>%
st_drop_geometry() %>%
dplyr::filter(year<2021 & results == 'Fail') %>%
group_by(dba_name,address) %>% tally() %>%
ungroup() %>%
rename(prev_fails = 'n')
data_2021_2 <- left_join(data_2021_2,fails,by=join_by(dba_name==dba_name,address==address)) %>%
mutate(prev_fails = replace_na(prev_fails,0),
fails_Per_day = replace_na(prev_fails/age,0),
fails_Per_day = ifelse(is.infinite(fails_Per_day), prev_fails, fails_Per_day))
ggplot() +
geom_sf(data = neighboorhoods, color = "white", fill = "grey80") +
geom_sf(data = data_2021_2, aes(color = cut(prev_fails, breaks = c(-1, 0, 1, 2, 5, max(prev_fails, na.rm=TRUE)))), size = 0.3)+
scale_color_viridis_d(name = "# of Failures", labels = c('0','1','2','3-5', '>5'))+
labs(title = "Number of Previous Inspection Failures") +
theme_void()
Here, we integrate census data into the 2021 restaurant inspection dataset through a spatial join, associating each restaurant record with corresponding census block group information. A neighborhood level calculation is performed to determine the mean values for variables such as percent poverty, median rent, and percent non-white by neighborhood. Restaurants that lack a specific census block group association are assigned the mean value for the neighborhood the restaurant is located in. In the end, each restaurant entry is supported with the census socioeconomic information, producing a more detailed picture of demographic condition of the area surrounding each restaurant in our 2021 inspection dataset which will be used to build our model.
The maps below shows the census information for the restaurants in the 2021 inspection dataset.
data_2021_2 <- data_2021_2 %>%
st_join(.,census_data,predicate = st_intersects)
means_neigh <- data_2021_2 %>%
st_drop_geometry() %>%
group_by(pri_neigh) %>% summarize_at(vars("pct_poverty","Median_Rent","pct_non_white"),mean,na.rm=TRUE)
data_2021_2 <- left_join(data_2021_2,means_neigh,by='pri_neigh') %>%
mutate(pct_poverty = ifelse(is.na(pct_poverty.x),pct_poverty.y,pct_poverty.x),
Median_Rent = ifelse(is.na(Median_Rent.x),Median_Rent.y,Median_Rent.x),
pct_non_white = ifelse(is.na(pct_non_white.x),pct_non_white.y,pct_non_white.x)) %>%
select(-ends_with('.x'),-ends_with('.y'))
grid.arrange(ncol=2,
ggplot() +
geom_sf(data = neighboorhoods, color = "white", fill = "grey80") +
geom_sf(data = data_2021_2, aes(color = pct_poverty), size = 0.3) +
scale_color_viridis_c(name = "Percent Poverty")+
labs(title = "Percent of Households Below Poverty Line") +
theme_void(),
ggplot() +
geom_sf(data = neighboorhoods, color = "white", fill = "grey80") +
geom_sf(data = data_2021_2, aes(color = Median_Rent), size = 0.3) +
scale_color_viridis_c(name = "Median Rent")+
labs(title = "Median Rent in USD") +
theme_void()
)
This map illustrates the outcomes of food inspections in 2021, which will serve as the training data for our model. Notably, a concentration of inspection points is observed in and around the Loop, as well as along traffic corridors. The spatial distribution reveals clear clustering in areas where inspection failures are prevalent.
custom_colors <- c("Pass" = "#41B6E6", "Fail" = "#E4002B")
ggplot() +
geom_sf(data = neighboorhoods) +
geom_sf(data = data_2021, aes(color = results), size = 0.3) +
scale_color_manual(values = custom_colors) +
theme_void()